38 research outputs found
Recognizing Objects And Reasoning About Their Interactions
The task of scene understanding involves recognizing the different objects present in the scene, segmenting the scene into meaningful regions, as well as obtaining a holistic understanding of the activities taking place in the scene. Each of these problems has received considerable interest within the computer vision community. We present contributions to two aspects of visual scene understanding.
First we explore multiple methods of feature selection for the problem of object detection. We demonstrate the use of Principal Component Analysis to detect avifauna in field observation videos. We improve on existing approaches by making robust decisions based on regional features and by a feature selection strategy that chooses different features in different parts of the image. We then demonstrate the use of Partial Least Squares to detect vehicles in aerial and satellite imagery. We propose two new feature sets; Color Probability Maps are used to capture the color statistics of vehicles and their surroundings, and Pairs of Pixels are used to capture captures the structural characteristics of objects. A powerful feature selection analysis based on Partial Least Squares is employed to deal with the resulting high dimensional feature space (almost 70,000 dimensions). We also propose an Incremental Multiple Kernel Learning (IMKL) scheme to detect vehicles in a traffic surveillance scenario. Obtaining task and scene specific datasets of visual categories is far more tedious than obtaining a generic dataset of the same classes. Our IMKL approach initializes on a generic training database and then tunes itself to the classification task at hand.
Second, we develop a video understanding system for scene elements, such as bus stops, crosswalks, and intersections, that are characterized more by qualitative activities and geometry than by intrinsic appearance. The domain models for scene elements are not learned from a corpus of video, but instead, naturally elicited by humans, and represented as probabilistic logic rules within a Markov Logic Network framework. Human elicited models, however, represent object interactions as they occur in the 3D world rather than describing their appearance projection in some specific 2D image plane. We bridge this gap by recovering qualitative scene geometry to analyze object interactions in the 3D world and then reasoning about scene geometry, occlusions and common sense domain knowledge using a set of meta-rules
Visual Programming: Compositional visual reasoning without training
We present VISPROG, a neuro-symbolic approach to solving complex and
compositional visual tasks given natural language instructions. VISPROG avoids
the need for any task-specific training. Instead, it uses the in-context
learning ability of large language models to generate python-like modular
programs, which are then executed to get both the solution and a comprehensive
and interpretable rationale. Each line of the generated program may invoke one
of several off-the-shelf computer vision models, image processing routines, or
python functions to produce intermediate outputs that may be consumed by
subsequent parts of the program. We demonstrate the flexibility of VISPROG on 4
diverse tasks - compositional visual question answering, zero-shot reasoning on
image pairs, factual knowledge object tagging, and language-guided image
editing. We believe neuro-symbolic approaches like VISPROG are an exciting
avenue to easily and effectively expand the scope of AI systems to serve the
long tail of complex tasks that people may wish to perform
Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering
A number of studies have found that today's Visual Question Answering (VQA)
models are heavily driven by superficial correlations in the training data and
lack sufficient image grounding. To encourage development of models geared
towards the latter, we propose a new setting for VQA where for every question
type, train and test sets have different prior distributions of answers.
Specifically, we present new splits of the VQA v1 and VQA v2 datasets, which we
call Visual Question Answering under Changing Priors (VQA-CP v1 and VQA-CP v2
respectively). First, we evaluate several existing VQA models under this new
setting and show that their performance degrades significantly compared to the
original VQA setting. Second, we propose a novel Grounded Visual Question
Answering model (GVQA) that contains inductive biases and restrictions in the
architecture specifically designed to prevent the model from 'cheating' by
primarily relying on priors in the training data. Specifically, GVQA explicitly
disentangles the recognition of visual concepts present in the image from the
identification of plausible answer space for a given question, enabling the
model to more robustly generalize across different distributions of answers.
GVQA is built off an existing VQA model -- Stacked Attention Networks (SAN).
Our experiments demonstrate that GVQA significantly outperforms SAN on both
VQA-CP v1 and VQA-CP v2 datasets. Interestingly, it also outperforms more
powerful VQA models such as Multimodal Compact Bilinear Pooling (MCB) in
several cases. GVQA offers strengths complementary to SAN when trained and
evaluated on the original VQA v1 and VQA v2 datasets. Finally, GVQA is more
transparent and interpretable than existing VQA models.Comment: 15 pages, 10 figures. To appear in IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 201
ELASTIC: Improving CNNs with Dynamic Scaling Policies
Scale variation has been a challenge from traditional to modern approaches in
computer vision. Most solutions to scale issues have a similar theme: a set of
intuitive and manually designed policies that are generic and fixed (e.g. SIFT
or feature pyramid). We argue that the scaling policy should be learned from
data. In this paper, we introduce ELASTIC, a simple, efficient and yet very
effective approach to learn a dynamic scale policy from data. We formulate the
scaling policy as a non-linear function inside the network's structure that (a)
is learned from data, (b) is instance specific, (c) does not add extra
computation, and (d) can be applied on any network architecture. We applied
ELASTIC to several state-of-the-art network architectures and showed consistent
improvement without extra (sometimes even lower) computation on ImageNet
classification, MSCOCO multi-label classification, and PASCAL VOC semantic
segmentation. Our results show major improvement for images with scale
challenges. Our code is available here: https://github.com/allenai/elasticComment: CVPR 2019 oral, code available https://github.com/allenai/elasti
I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision
Many high-level skills that are required for computer vision tasks, such as
parsing questions, comparing and contrasting semantics, and writing
descriptions, are also required in other domains such as natural language
processing. In this paper, we ask whether it is possible to learn those skills
from text data and then transfer them to vision tasks without ever training on
visual training data. Key to our approach is exploiting the joint embedding
space of contrastively trained vision and language encoders. In practice, there
can be systematic differences between embedding spaces for different modalities
in contrastive models, and we analyze how these differences affect our approach
and study strategies to mitigate this concern. We produce models using only
text training data on four representative tasks: image captioning, visual
entailment, visual question answering and visual news captioning, and evaluate
them on standard benchmarks using images. We find these models perform close to
models trained on images, while surpassing prior work for captioning and visual
entailment in this text-only setting by over 9 points, and outperforming all
prior work on visual news by over 30 points. We also showcase a variety of
stylistic image captioning models that are trained using no image data and no
human-curated language data, but instead using readily-available text data from
books, the web, or language models.Comment: website (https://prior.allenai.org/projects/close), code
(https://github.com/allenai/close